Skip to content

xet-data: drop sha2/asm to fix v1-resolver Windows breakage#833

Closed
assafvayner wants to merge 1 commit into
mainfrom
assaf/fix-sha2-asm-windows-leak
Closed

xet-data: drop sha2/asm to fix v1-resolver Windows breakage#833
assafvayner wants to merge 1 commit into
mainfrom
assaf/fix-sha2-asm-windows-leak

Conversation

@assafvayner
Copy link
Copy Markdown
Contributor

@assafvayner assafvayner commented May 7, 2026

Summary

sha2-asm is explicitly unsupported on Windows: its lib.rs has a compile_error! on cfg(target_os = "windows"), and its build.rs feeds GAS-syntax .S files to the C compiler — on windows-msvc cl.exe doesn't understand them, so the link step fails with:

LINK : fatal error LNK1181: cannot open input file '...sha2-asm-.../out/...sha512_x64.o'

xet_data/Cargo.toml already declared sha2/asm only under cfg(not(target_os = "windows")) (introduced in #693), which looks correct on its face. But Cargo's v1 feature resolver does not respect target-conditional cfg(...) gates around feature activations — it unifies features globally across the resolve graph regardless of which target's deps are active. Crates on edition < 2021 with no explicit resolver = "2" (in their package or workspace Cargo.toml) use v1 by default.

The result: any v1-resolver crate transitively depending on xet-data (through hf-xethf-hub, etc.) still gets sha2/asm activated on Windows, and the build fails.

This was caught while bumping tokenizers (edition 2018, no resolver) to hf-hub 1.0.0-rc.1 — its windows-latest CI now fails on sha2-asm linking. See tokenizers#2047 build (windows-latest).

There is no way to fix the leak with a target-conditional gate — that's exactly what v1 ignores. Options are: drop the activation, or expose sha2/asm as an opt-in feature. This PR drops it.

The perf cost is small in practice: sha2 0.10 already runtime-detects SHA-NI on x86/x86_64 (and ARM Crypto Extensions on aarch64) via cpufeatures, which is an unconditional dep independent of the asm feature. SHA-NI is by far the dominant fast path on modern hardware. The sha2-asm fallback only helps on older x86 CPUs without SHA-NI, where the gap is modest.

The leftover wasm-only features = ["asm"] in wasm/hf_xet_wasm/Cargo.toml is left alone; it isn't reachable from the published crates' dep graphs (target_arch = wasm32, and that crate isn't a transitive dep of hf-xet).

Test plan

  • cargo check -p xet-data passes locally.
  • cargo tree -p xet-data --invert sha2-asm returns "did not match any packages" (i.e. sha2-asm is no longer in the tree).
  • CI: confirms no regression on existing platforms.
  • Validate downstream: re-run tokenizers#2047 build (windows-latest) against an hf-xet build that pulls this xet-data — expected to clear the linker error.

Note

Low Risk
Low risk: dependency/feature-flag change only; main impact is potentially reduced SHA-2 performance on older non-SHA-NI CPUs while improving Windows compatibility for downstream crates.

Overview
Removes target-conditional activation of sha2's asm feature in xet_data/Cargo.toml and replaces it with a single unconditional sha2 dependency.

This prevents sha2-asm from being pulled into downstream dependency graphs (notably under Cargo’s v1 feature resolver), avoiding Windows build/link failures while keeping sha2’s runtime CPU fast paths.

Reviewed by Cursor Bugbot for commit fabb24a. Bugbot is set up for automated code reviews on this repo. Configure here.

`sha2-asm` is explicitly unsupported on Windows (its `lib.rs` has a
`compile_error!` on `cfg(target_os = "windows")` and its build.rs feeds
GAS-syntax `.S` files to MSVC, which does not understand them).

`xet_data/Cargo.toml` previously declared `sha2/asm` only on
`cfg(not(target_os = "windows"))`. That looks correct, but Cargo's v1
feature resolver (used by crates on edition < 2021 with no explicit
`resolver = "2"`) does not respect target-conditional `cfg(...)` gates
around feature activations: it unifies features globally across the
resolve graph, so the `asm` feature still gets enabled on Windows for
downstream consumers using v1. The result: any v1-resolver crate
depending transitively on `xet-data` (e.g. `tokenizers` via
`hf-hub` -> `hf-xet`) fails to build on `windows-msvc` with:

  LINK : fatal error LNK1181: cannot open input file '...sha512_x64.o'

We can't fix this with a target-conditional gate; the only options are
to drop the `sha2/asm` activation or expose it as an opt-in feature.

Drop it. `sha2 0.10` already runtime-detects SHA-NI on x86/x86_64 and
ARM Crypto Extensions on aarch64 via the `cpufeatures` crate
(unconditional dep, independent of the `asm` feature), which is the
dominant fast path on modern hardware. The `sha2-asm` fallback only
helps on older x86 CPUs without SHA-NI, where the gain is small. The
loss is acceptable; the gain is fixing Windows for all downstream
consumers regardless of resolver version.
@assafvayner assafvayner requested review from hoytak and seanses May 7, 2026 21:17
@assafvayner
Copy link
Copy Markdown
Contributor Author

Closing — wrong layer for the fix. The target-conditional gate here is correct under Cargo's v2 feature resolver; the leak only happens for v1-resolver consumers (edition < 2021 with no resolver = "2" set). Verified that adding resolver = "2" to the consuming crate eliminates sha2-asm from the windows-msvc resolve. Will fix downstream instead.

@assafvayner assafvayner closed this May 7, 2026
@assafvayner assafvayner deleted the assaf/fix-sha2-asm-windows-leak branch May 7, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant